Import and warehouse data
!pip install lucifer-ml
Requirement already satisfied: lucifer-ml in c:\users\admin\anaconda3\lib\site-packages (0.0.57) Requirement already satisfied: matplotlib in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (3.3.4) Requirement already satisfied: pandas in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.2.4) Requirement already satisfied: seaborn in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (0.11.1) Requirement already satisfied: scikit-learn in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.0) Requirement already satisfied: lightgbm in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (3.3.0) Requirement already satisfied: scipy in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.6.2) Requirement already satisfied: imblearn in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (0.0) Requirement already satisfied: xgboost in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.4.2) Requirement already satisfied: tensorflow in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (2.6.0) Requirement already satisfied: catboost in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.0.0) Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from lucifer-ml) (1.19.5) Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from catboost->lucifer-ml) (1.15.0) Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (from catboost->lucifer-ml) (5.3.1) Requirement already satisfied: graphviz in c:\users\admin\anaconda3\lib\site-packages (from catboost->lucifer-ml) (0.17) Requirement already satisfied: pytz>=2017.3 in c:\users\admin\anaconda3\lib\site-packages (from pandas->lucifer-ml) (2021.1) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\admin\anaconda3\lib\site-packages (from pandas->lucifer-ml) (2.8.1) Requirement already satisfied: imbalanced-learn in c:\users\admin\anaconda3\lib\site-packages (from imblearn->lucifer-ml) (0.8.0) Requirement already satisfied: wheel in c:\users\admin\anaconda3\lib\site-packages (from lightgbm->lucifer-ml) (0.36.2) Requirement already satisfied: joblib>=0.11 in c:\users\admin\anaconda3\lib\site-packages (from scikit-learn->lucifer-ml) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from scikit-learn->lucifer-ml) (2.1.0) Requirement already satisfied: cycler>=0.10 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->lucifer-ml) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->lucifer-ml) (1.3.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->lucifer-ml) (8.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->lucifer-ml) (2.4.7) Requirement already satisfied: clang~=5.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (5.0) Requirement already satisfied: h5py~=3.1.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (3.1.0) Requirement already satisfied: protobuf>=3.9.2 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (3.18.1) Requirement already satisfied: typing-extensions~=3.7.4 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (3.7.4.3) Requirement already satisfied: grpcio<2.0,>=1.37.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.41.0) Requirement already satisfied: astunparse~=1.6.3 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.6.3) Requirement already satisfied: gast==0.4.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (0.4.0) Requirement already satisfied: termcolor~=1.1.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.1.0) Requirement already satisfied: opt-einsum~=3.3.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (3.3.0) Requirement already satisfied: tensorflow-estimator~=2.6 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (2.6.0) Requirement already satisfied: flatbuffers~=1.12.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.12) Requirement already satisfied: keras-preprocessing~=1.1.2 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.1.2) Requirement already satisfied: tensorboard~=2.6 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (2.6.0) Requirement already satisfied: google-pasta~=0.2 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (0.2.0) Requirement already satisfied: absl-py~=0.10 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (0.14.1) Requirement already satisfied: keras~=2.6 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (2.6.0) Requirement already satisfied: wrapt~=1.12.1 in c:\users\admin\anaconda3\lib\site-packages (from tensorflow->lucifer-ml) (1.12.1) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (2.25.1) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (1.35.0) Requirement already satisfied: setuptools>=41.0.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (52.0.0.post20210125) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (0.6.1) Requirement already satisfied: markdown>=2.6.8 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (3.3.4) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (1.8.0) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (1.0.1) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\admin\anaconda3\lib\site-packages (from tensorboard~=2.6->tensorflow->lucifer-ml) (0.4.6) Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly->catboost->lucifer-ml) (8.0.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\admin\anaconda3\lib\site-packages (from google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow->lucifer-ml) (0.2.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow->lucifer-ml) (4.2.4) Requirement already satisfied: rsa<5,>=3.1.4 in c:\users\admin\anaconda3\lib\site-packages (from google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow->lucifer-ml) (4.7.2) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\admin\anaconda3\lib\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow->lucifer-ml) (1.3.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow->lucifer-ml) (2020.12.5) Requirement already satisfied: idna<3,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow->lucifer-ml) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow->lucifer-ml) (1.26.4) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\admin\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard~=2.6->tensorflow->lucifer-ml) (4.0.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\admin\anaconda3\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard~=2.6->tensorflow->lucifer-ml) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\admin\anaconda3\lib\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow->lucifer-ml) (3.1.1)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import plotly.express as px
from luciferml.supervised import classification as cls
Car = pd.read_json('Part1 - Car-Attributes.json')
print(Car.shape)
Car.head()
(398, 8)
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
Car1 = pd.read_csv('Part1 - Car name.csv')
print(Car1.shape)
Car1.head()
(398, 1)
| car_name | |
|---|---|
| 0 | chevrolet chevelle malibu |
| 1 | buick skylark 320 |
| 2 | plymouth satellite |
| 3 | amc rebel sst |
| 4 | ford torino |
Automobile = pd.concat([Car,Car1],axis=1)
print(Automobile.shape)
Automobile.head()
(398, 9)
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
Automobile.to_csv('mpg.csv', index=False)
Automobile.to_excel('mpg.xlsx', index = False)
Automobile.to_json('mpg.json', orient = 'split', compression = 'infer', index = 'true')
Automobile.dtypes
mpg float64 cyl int64 disp float64 hp object wt int64 acc float64 yr int64 origin int64 car_name object dtype: object
print(Automobile.isnull().sum())
mpg 0 cyl 0 disp 0 hp 0 wt 0 acc 0 yr 0 origin 0 car_name 0 dtype: int64
Data cleansing, Data analysis and visualisation
Automobile.describe().T.style.bar(
subset=['mean'],
color='Reds').background_gradient(
subset=['std'], cmap='ocean').background_gradient(subset=['50%'], cmap='PuBu')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mpg | 398.000000 | 23.514573 | 7.815984 | 9.000000 | 17.500000 | 23.000000 | 29.000000 | 46.600000 |
| cyl | 398.000000 | 5.454774 | 1.701004 | 3.000000 | 4.000000 | 4.000000 | 8.000000 | 8.000000 |
| disp | 398.000000 | 193.425879 | 104.269838 | 68.000000 | 104.250000 | 148.500000 | 262.000000 | 455.000000 |
| wt | 398.000000 | 2970.424623 | 846.841774 | 1613.000000 | 2223.750000 | 2803.500000 | 3608.000000 | 5140.000000 |
| acc | 398.000000 | 15.568090 | 2.757689 | 8.000000 | 13.825000 | 15.500000 | 17.175000 | 24.800000 |
| yr | 398.000000 | 76.010050 | 3.697627 | 70.000000 | 73.000000 | 76.000000 | 79.000000 | 82.000000 |
| origin | 398.000000 | 1.572864 | 0.802055 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 3.000000 |
def boxhistplot(columns,data):
fig = px.histogram(Automobile, x = Automobile[column], color = 'car_name')
fig.show()
fig2 = px.box(Automobile, x = Automobile[column], color = 'car_name')
fig2.show()
col = ['mpg','cyl','disp','wt','acc','yr','origin','hp']
for column in col:
boxhistplot(column, Automobile)
sns.heatmap(Automobile.corr(), annot=True, cmap="flag")
<AxesSubplot:>
Automobile_num=Automobile.drop(['cyl','yr','origin','mpg','car_name'],axis=1)
Automobile_num.head()
| disp | hp | wt | acc | |
|---|---|---|---|---|
| 0 | 307.0 | 130 | 3504 | 12.0 |
| 1 | 350.0 | 165 | 3693 | 11.5 |
| 2 | 318.0 | 150 | 3436 | 11.0 |
| 3 | 304.0 | 150 | 3433 | 12.0 |
| 4 | 302.0 | 140 | 3449 | 10.5 |
Automobile_num.hist(bins = 20, figsize = (10, 8), color = 'Black')
plt.show()
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
from scipy.stats import zscore
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
Automobile_num.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 disp 398 non-null float64 1 hp 398 non-null object 2 wt 398 non-null int64 3 acc 398 non-null float64 dtypes: float64(2), int64(1), object(1) memory usage: 12.6+ KB
sns.pairplot(Automobile, diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x21f094591c0>
Machine learning
K-Means
pip install h2o
Requirement already satisfied: h2o in c:\users\admin\anaconda3\lib\site-packages (3.34.0.3)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: future in c:\users\admin\anaconda3\lib\site-packages (from h2o) (0.18.2) Requirement already satisfied: tabulate in c:\users\admin\anaconda3\lib\site-packages (from h2o) (0.8.9) Requirement already satisfied: requests in c:\users\admin\anaconda3\lib\site-packages (from h2o) (2.25.1) Requirement already satisfied: idna<3,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests->h2o) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests->h2o) (1.26.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests->h2o) (2020.12.5) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\admin\anaconda3\lib\site-packages (from requests->h2o) (4.0.0)
import h2o
from h2o.estimators import H2OKMeansEstimator
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")
Checking whether there is an H2O instance running at http://192.168.59.147:54321 ..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM (build 25.311-b11, mixed mode) Starting server from C:\Users\Admin\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\Admin\AppData\Local\Temp\tmps5cutt0a JVM stdout: C:\Users\Admin\AppData\Local\Temp\tmps5cutt0a\h2o_Admin_started_from_python.out JVM stderr: C:\Users\Admin\AppData\Local\Temp\tmps5cutt0a\h2o_Admin_started_from_python.err Server is running at http://127.0.0.1:54323 Connecting to H2O server at http://127.0.0.1:54323 ... successful.
| H2O_cluster_uptime: | 06 secs |
| H2O_cluster_timezone: | Asia/Kolkata |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.34.0.3 |
| H2O_cluster_version_age: | 1 month and 9 days |
| H2O_cluster_name: | H2O_from_python_Admin_bjvca0 |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 1.689 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54323 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| H2O_API_Extensions: | Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
| Python_version: | 3.8.8 final |
Automobile_Num = Automobile.iloc[:,0:7]
Automobile_Num.head()
Automobile_Num.dtypes
Automobile_num2 = Automobile_Num.drop(['hp'],axis=1)
Automobile_Num_z1 = Automobile_num2.apply(zscore)
Automobile_Num_z1.head()
| mpg | cyl | disp | wt | acc | yr | |
|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.630870 | -1.295498 | -1.627426 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 0.854333 | -1.477038 | -1.627426 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 0.550470 | -1.658577 | -1.627426 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 0.546923 | -1.295498 | -1.627426 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.565841 | -1.840117 | -1.627426 |
wss =[]
for i in range(1,5):
KM = KMeans(n_clusters=i)
KM.fit(Automobile_Num_z1)
wss.append(KM.inertia_)
wss
plt.plot(range(1,5), wss);
plt.title('Elbow Method');
plt.xlabel("Number of Clusters")
plt.ylabel("WSS");
k_means = KMeans(n_clusters = 2)
k_means.fit(Automobile_Num_z1)
labels = k_means.labels_
silhouette_score(Automobile_Num_z1,labels)
0.4395212407688038
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 42,
}
silhouette_coefficients = []
for k in range(2, 7):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(Automobile_Num_z1)
score = silhouette_score(Automobile_Num_z1,kmeans.labels_)
silhouette_coefficients.append(score)
plt.plot(range(2, 7), silhouette_coefficients)
plt.xticks(range(2, 7))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
from timeit import default_timer as timer
from datetime import timedelta
import time
start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))
Time: 0:00:00.000049
dataset_h2o = h2o.H2OFrame(Automobile_num2)
h2o_km = H2OKMeansEstimator(k=2, init="furthest", standardize=True)
start = timer()
h2o_km.train(training_frame=dataset_h2o)
end = timer()
user_points = h2o.H2OFrame(h2o_km.centers())
h2o_km.show()
time_km = timedelta(seconds=end-start)
print("Time:", time_km)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_1 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 398.0 | 2.0 | 0.0 | 4.0 | 1130.875093 | 2382.0 | 1251.124907 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 1130.8750923291825 Total Sum of Square Error to Grand Mean: 2382.000012600604 Between Cluster Sum of Square Error: 1251.1249202714214 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 231.0 | 632.391393 | |
| 1 | 2.0 | 167.0 | 498.483699 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:02:21 | 0.134 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:02:22 | 0.400 sec | 1.0 | 398.0 | 2058.747250 | |
| 2 | 2021-11-16 23:02:22 | 0.439 sec | 2.0 | 23.0 | 1140.428174 | |
| 3 | 2021-11-16 23:02:22 | 0.447 sec | 3.0 | 2.0 | 1130.895393 | |
| 4 | 2021-11-16 23:02:22 | 0.454 sec | 4.0 | 0.0 | 1130.875093 |
Time: 0:00:01.034527
h2o_km_co = H2OKMeansEstimator(k=2, user_points=user_points, cluster_size_constraints=[200, 150], standardize=True)
start = timer()
h2o_km_co.train(training_frame=dataset_h2o)
end = timer()
h2o_km_co.show()
time_km_co = timedelta(seconds=end-start)
print("Time:", time_km_co)
kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_2 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 398.0 | 2.0 | 0.0 | 2.0 | 1130.875093 | 2382.0 | 1251.124907 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 1130.8750927223805 Total Sum of Square Error to Grand Mean: 2382.000000000007 Between Cluster Sum of Square Error: 1251.1249072776263 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 231.0 | 632.391397 | |
| 1 | 2.0 | 167.0 | 498.483695 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:02:23 | 0.000 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:02:24 | 1.615 sec | 1.0 | 398.0 | 1130.875093 | |
| 2 | 2021-11-16 23:02:25 | 2.359 sec | 2.0 | 0.0 | 1130.875093 |
Time: 0:00:02.540205
import pandas as pd
dataset = pd.read_csv('mpg.csv')
print(dataset.shape)
dataset.head()
(398, 9)
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
dataset_h2o = h2o.H2OFrame(dataset)
h2o_km = H2OKMeansEstimator(k=2, init="furthest", standardize=True)
start = timer()
h2o_km.train(training_frame=dataset_h2o)
end = timer()
user_points = h2o.H2OFrame(h2o_km.centers())
h2o_km.show()
time_km = timedelta(seconds=end-start)
print("Time:", time_km)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_3 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 398.0 | 2.0 | 1.0 | 9.0 | 1985.139986 | 3562.0 | 1576.860014 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 1985.139993914968 Total Sum of Square Error to Grand Mean: 3562.00002311023 Between Cluster Sum of Square Error: 1576.8600291952619 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 274.0 | 1496.311194 | |
| 1 | 2.0 | 124.0 | 488.828800 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:02:26 | 0.004 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:02:26 | 0.033 sec | 1.0 | 398.0 | 3648.147094 | |
| 2 | 2021-11-16 23:02:26 | 0.037 sec | 2.0 | 45.0 | 2172.341607 | |
| 3 | 2021-11-16 23:02:26 | 0.041 sec | 3.0 | 8.0 | 2008.852715 | |
| 4 | 2021-11-16 23:02:26 | 0.044 sec | 4.0 | 6.0 | 2001.856427 | |
| 5 | 2021-11-16 23:02:26 | 0.047 sec | 5.0 | 6.0 | 1995.861084 | |
| 6 | 2021-11-16 23:02:26 | 0.050 sec | 6.0 | 5.0 | 1988.416527 | |
| 7 | 2021-11-16 23:02:26 | 0.053 sec | 7.0 | 2.0 | 1985.647628 | |
| 8 | 2021-11-16 23:02:26 | 0.055 sec | 8.0 | 1.0 | 1985.208974 | |
| 9 | 2021-11-16 23:02:26 | 0.057 sec | 9.0 | 0.0 | 1985.139986 |
Time: 0:00:00.269699
dataset.columns = ["mpg","cyl","disp","hp","wt","acc","yr","origin","car_name"]
dataset.loc[dataset["car_name"] == "noise", "car_name"] = 9
dataset["car_name"] = dataset["car_name"].astype("category")
dataset
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 | 82 | 1 | ford mustang gl |
| 394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 | 82 | 2 | vw pickup |
| 395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 | 82 | 1 | dodge rampage |
| 396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 | 82 | 1 | ford ranger |
| 397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
import matplotlib.pyplot as plt
groups = dataset.groupby('car_name')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.mpg, group.hp, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Car dataset", fontsize=20)
ax.legend(numpoints=1)
<matplotlib.legend.Legend at 0x21f0c511580>
data_h2o_dataset = h2o.H2OFrame(dataset)
h2o_km_dataset = H2OKMeansEstimator(k=2, init="furthest", standardize=True)
start = timer()
h2o_km_dataset.train(x=["mpg", "hp"], training_frame=data_h2o_dataset)
end = timer()
user_points = h2o.H2OFrame(h2o_km_dataset.centers())
h2o_km_dataset.show()
print("Time:", timedelta(seconds=end-start))
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_4 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 398.0 | 2.0 | 0.0 | 9.0 | 325.848475 | 788.0 | 462.151525 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 325.8484792231544 Total Sum of Square Error to Grand Mean: 788.0000045834372 Between Cluster Sum of Square Error: 462.1515253602828 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 233.0 | 166.269692 | |
| 1 | 2.0 | 165.0 | 159.578787 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:02:58 | 0.029 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:02:58 | 0.055 sec | 1.0 | 398.0 | 1735.062601 | |
| 2 | 2021-11-16 23:02:58 | 0.058 sec | 2.0 | 5.0 | 329.534171 | |
| 3 | 2021-11-16 23:02:58 | 0.061 sec | 3.0 | 7.0 | 328.849956 | |
| 4 | 2021-11-16 23:02:58 | 0.066 sec | 4.0 | 6.0 | 327.749765 | |
| 5 | 2021-11-16 23:02:58 | 0.070 sec | 5.0 | 4.0 | 326.801268 | |
| 6 | 2021-11-16 23:02:58 | 0.073 sec | 6.0 | 4.0 | 326.433332 | |
| 7 | 2021-11-16 23:02:58 | 0.076 sec | 7.0 | 3.0 | 326.074668 | |
| 8 | 2021-11-16 23:02:58 | 0.080 sec | 8.0 | 1.0 | 325.865553 | |
| 9 | 2021-11-16 23:02:58 | 0.082 sec | 9.0 | 0.0 | 325.848475 |
Time: 0:00:00.324790
h2o_km_co_dataset = H2OKMeansEstimator(k=2, user_points=user_points, standardize=True)
start = timer()
h2o_km_co_dataset.train(x=["mpg","hp"], training_frame=data_h2o_dataset)
end = timer()
h2o_km_co_dataset.show()
time_h2o_km_co_dataset = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_dataset)
kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_5 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 398.0 | 2.0 | 0.0 | 2.0 | 325.848475 | 788.0 | 462.151525 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 325.8484792231544 Total Sum of Square Error to Grand Mean: 788.0000045834372 Between Cluster Sum of Square Error: 462.1515253602828 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 233.0 | 166.269692 | |
| 1 | 2.0 | 165.0 | 159.578787 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:02:58 | 0.004 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:02:58 | 0.008 sec | 1.0 | 398.0 | 325.848475 | |
| 2 | 2021-11-16 23:02:58 | 0.010 sec | 2.0 | 0.0 | 325.848475 |
Time: 0:00:00.285729
from h2o.estimators.aggregator import H2OAggregatorEstimator
params = { "target_num_exemplars": 350,
"rel_tol_num_exemplars": 0.5,
"categorical_encoding": "eigen"}
agg = H2OAggregatorEstimator(**params)
start = timer()
agg.train(x=["mpg","hp","car_name"], training_frame=data_h2o_dataset)
data_agg_12_dataset = agg.aggregated_frame
h2o_km_co_agg_12_dataset = H2OKMeansEstimator(k=2, user_points=user_points, standardize=True)
h2o_km_co_agg_12_dataset.train(x=["mpg","hp"],training_frame=data_agg_12_dataset)
end = timer()
h2o_km_co_agg_12_dataset.show()
time_h2o_km_co_agg_12_dataset = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_dataset)
aggregator Model Build progress: |███████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637083921092_7 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 246.0 | 2.0 | 0.0 | 2.0 | 195.502341 | 487.0 | 291.497659 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 195.50234245877655 Total Sum of Square Error to Grand Mean: 487.0000009613941 Between Cluster Sum of Square Error: 291.4976585026176 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 143.0 | 98.890552 | |
| 1 | 2.0 | 103.0 | 96.611790 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-16 23:03:00 | 0.009 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-16 23:03:00 | 0.018 sec | 1.0 | 246.0 | 197.493165 | |
| 2 | 2021-11-16 23:03:00 | 0.025 sec | 2.0 | 0.0 | 195.502341 |
Time: 0:00:01.347042
groups = dataset.groupby('car_name')
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.mpg, group.hp, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Original Car dataset", fontsize=20)
ax.legend(numpoints=1)
<matplotlib.legend.Legend at 0x21f0e37c370>
data_agg_df_12_dataset = data_agg_12_dataset.as_data_frame()
data_agg_df_12_dataset["car_name"] = data_agg_df_12_dataset["car_name"].astype("category")
groups = data_agg_df_12_dataset.groupby("car_name")
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.mpg, group.hp, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Aggregated Car Dataset", fontsize=20)
ax.legend(numpoints=1)
<matplotlib.legend.Legend at 0x21f101fa370>
data_agg_df_12_dataset.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu | 2 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | buick skylark 320 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | plymouth satellite | 2 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | amc rebel sst | 4 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | ford torino | 3 |
data_agg_df_12_dataset.to_csv('kmeans.csv', index=False)
kmeans_data = pd.read_csv('kmeans.csv')
kmeans_data.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | car_name | counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu | 2 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | buick skylark 320 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | plymouth satellite | 2 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | amc rebel sst | 4 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | ford torino | 3 |
dataset["km_pred"] = h2o_km_dataset.predict(data_h2o_dataset).as_data_frame()['predict'].astype("category")
groups = dataset.groupby('km_pred')
fig, ax = plt.subplots(1,1,figsize=(10,12))
for name, group in groups:
ax.plot(group.mpg, group.hp, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Predictions of standard K-means", fontsize=20)
ax.legend(numpoints=1)
kmeans prediction progress: |████████████████████████████████████████████████████| (done) 100%
<matplotlib.legend.Legend at 0x21f0c4bfa90>
Hierarchical clustering
import warnings
warnings.filterwarnings('ignore')
from sklearn import preprocessing
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
Auto = Automobile.iloc[:,0:7]
Auto.head()
| mpg | cyl | disp | hp | wt | acc | yr | |
|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 |
hpIsDigit = pd.DataFrame(Auto.hp.str.isdigit())
Auto[hpIsDigit['hp'] == False]
| mpg | cyl | disp | hp | wt | acc | yr | |
|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 |
| 126 | 21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 |
| 330 | 40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 |
| 336 | 23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 |
| 354 | 34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 |
| 374 | 23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 |
Auto = Auto.replace('?', np.nan)
Auto[hpIsDigit['hp'] == False]
| mpg | cyl | disp | hp | wt | acc | yr | |
|---|---|---|---|---|---|---|---|
| 32 | 25.0 | 4 | 98.0 | NaN | 2046 | 19.0 | 71 |
| 126 | 21.0 | 6 | 200.0 | NaN | 2875 | 17.0 | 74 |
| 330 | 40.9 | 4 | 85.0 | NaN | 1835 | 17.3 | 80 |
| 336 | 23.6 | 4 | 140.0 | NaN | 2905 | 14.3 | 80 |
| 354 | 34.5 | 4 | 100.0 | NaN | 2320 | 15.8 | 81 |
| 374 | 23.0 | 4 | 151.0 | NaN | 3035 | 20.5 | 82 |
Auto.median()
mpg 23.0 cyl 4.0 disp 148.5 hp 93.5 wt 2803.5 acc 15.5 yr 76.0 dtype: float64
Auto['hp'].fillna((Auto['hp'].median()), inplace=True)
Auto.isnull().sum()
mpg 0 cyl 0 disp 0 hp 0 wt 0 acc 0 yr 0 dtype: int64
Auto_z = Auto.apply(zscore)
Auto_z.head()
| mpg | cyl | disp | hp | wt | acc | yr | |
|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 |
link_method = linkage(Auto_z.iloc[:,0:7], method = 'average')
plt.figure(figsize=(25, 10))
dendrogram(link_method)
plt.show()
dendrogram(
link_method,
truncate_mode='lastp',
p=2,
)
plt.show()
clusters = fcluster(link_method, 2, criterion='maxclust')
clusters
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2], dtype=int32)
Auto_z['clusters_H'] = clusters
Auto_z.head()
| mpg | cyl | disp | hp | wt | acc | yr | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | -0.706439 | 1.498191 | 1.090604 | 0.673118 | 0.630870 | -1.295498 | -1.627426 | 1 |
| 1 | -1.090751 | 1.498191 | 1.503514 | 1.589958 | 0.854333 | -1.477038 | -1.627426 | 1 |
| 2 | -0.706439 | 1.498191 | 1.196232 | 1.197027 | 0.550470 | -1.658577 | -1.627426 | 1 |
| 3 | -0.962647 | 1.498191 | 1.061796 | 1.197027 | 0.546923 | -1.295498 | -1.627426 | 1 |
| 4 | -0.834543 | 1.498191 | 1.042591 | 0.935072 | 0.565841 | -1.840117 | -1.627426 | 1 |
Auto_z.clusters_H.value_counts().sort_index()
1 100 2 298 Name: clusters_H, dtype: int64
Auto['clusters_H']=clusters
Auto['clusters_H']=clusters
Auto.head()
| mpg | cyl | disp | hp | wt | acc | yr | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 |
Hclus=Auto
Hclus.head()
| mpg | cyl | disp | hp | wt | acc | yr | clusters_H | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 |
aggdata=Auto.iloc[:,0:8].groupby('clusters_H').mean()
aggdata['Freq']=Auto.clusters_H.value_counts().sort_index()
aggdata
| mpg | cyl | disp | hp | wt | acc | yr | Freq | |
|---|---|---|---|---|---|---|---|---|
| clusters_H | ||||||||
| 1 | 14.684000 | 7.980000 | 345.470000 | 160.400000 | 4121.560000 | 12.702000 | 73.740000 | 100 |
| 2 | 26.477852 | 4.607383 | 142.404362 | 85.479866 | 2584.137584 | 16.529866 | 76.771812 | 298 |
aggdata.head()
| mpg | cyl | disp | hp | wt | acc | yr | Freq | |
|---|---|---|---|---|---|---|---|---|
| clusters_H | ||||||||
| 1 | 14.684000 | 7.980000 | 345.470000 | 160.400000 | 4121.560000 | 12.702000 | 73.740000 | 100 |
| 2 | 26.477852 | 4.607383 | 142.404362 | 85.479866 | 2584.137584 | 16.529866 | 76.771812 | 298 |
plt.figure(figsize=(10, 8))
sns.scatterplot(x="mpg", y="hp", hue="clusters_H",
data=Auto,
palette=['green','brown']);
Regression with original data
from luciferml.preprocessing import Preprocess as prep
import pandas as pd
Auto_lin = pd.read_csv('mpg.csv')
Auto_lin = prep.skewcorrect(dataset)
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Preprocessor
Skewness in numerical features:
Skewness
origin 0.920291
disp 0.716930
wt 0.529059
cyl 0.524934
mpg 0.455342
acc 0.277725
yr 0.011491
Skewness Before Transformation for origin: 0.9237762994760227
Mean before Transformation for Origin : 1.5728643216080402, Standard Deviation before Transformation for Origin : 0.8010466373811925
Skewness After Transformation for origin: 0.8142869610986857 Mean before Transformation for Origin : 0.9729236582384831, Standard Deviation before Transformation for Origin : 0.33082758735263834
Skewness Before Transformation for disp: 0.7196451643005952 Mean before Transformation for Disp : 193.42587939698493, Standard Deviation before Transformation for Disp : 104.13876352708573
Skewness After Transformation for disp: 0.22888576783929218 Mean before Transformation for Disp : 5.129939471346422, Standard Deviation before Transformation for Disp : 0.5274239931205971
Skewness Before Transformation for wt: 0.5310625125994629 Mean before Transformation for Wt : 2970.424623115578, Standard Deviation before Transformation for Wt : 845.7772335198174
Skewness After Transformation for wt: 0.15643508284041371 Mean before Transformation for Wt : 7.957254010490897, Standard Deviation before Transformation for Wt : 0.2802137303725529
Skewness Before Transformation for cyl: 0.5269215453528939 Mean before Transformation for Cyl : 5.454773869346734, Standard Deviation before Transformation for Cyl : 1.6988659605395604
Skewness After Transformation for cyl: 0.3930033711425588 Mean before Transformation for Cyl : 1.8316994320979822, Standard Deviation before Transformation for Cyl : 0.2542767757873721
Skewness Before Transformation for mpg: 0.45706634399491913 Mean before Transformation for Mpg : 23.514572864321607, Standard Deviation before Transformation for Mpg : 7.806159061274433
Skewness After Transformation for mpg: -0.10913831720100581 Mean before Transformation for Mpg : 3.1478302770344047, Standard Deviation before Transformation for Mpg : 0.32375928015327066
Skewness Before Transformation for acc: 0.27877684462588986 Mean before Transformation for Acc : 15.568090452261307, Standard Deviation before Transformation for Acc : 2.7542223175940177
Skewness After Transformation for acc: -0.3155766096458847 Mean before Transformation for Acc : 2.793445164973402, Standard Deviation before Transformation for Acc : 0.16892714758791103
Skewness Before Transformation for yr: 0.01153459401509278 Mean before Transformation for Yr : 76.01005025125629, Standard Deviation before Transformation for Yr : 3.6929784655780975
Skewness After Transformation for yr: -0.04770128229674319 Mean before Transformation for Yr : 4.342784131164792, Standard Deviation before Transformation for Yr : 0.04801962252153379
Elapsed Time: 6.327205181121826 seconds
from luciferml.supervised.regression import Regression
accuracy_scores = {}
dataset = Auto_lin
feature = ['cyl','disp','wt','yr','acc','hp']
X = Auto_lin[feature]
y = Auto_lin['mpg']
regressor = Regression(predictor = 'bag')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are Categorical
Encoding Features [*]
Encoding Features Done [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Converting Sparse Features to array []
Conversion of Sparse Features to array Done [ ✓ ]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Bagging Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 91.77 %
Validation Mean Absolute Error is : 0.06909115890666576
Validation Root Mean Squared Error is : 0.08999177812968095
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 85.28 %
Standard Deviation: 4.39 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.27361226081848145 seconds
from luciferml.supervised.regression import Regression
accuracy_scores = {}
dataset = Auto_lin
feature = ['cyl','disp','wt','yr','acc','hp']
X = Auto_lin[feature]
y = Auto_lin['mpg']
regressor = Regression(predictor = 'br')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are Categorical
Encoding Features [*]
Encoding Features Done [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Converting Sparse Features to array []
Conversion of Sparse Features to array Done [ ✓ ]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training BayesianRidge Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 84.60 %
Validation Mean Absolute Error is : 0.0927877173479569
Validation Root Mean Squared Error is : 0.12313478051750919
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 84.93 %
Standard Deviation: 3.58 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.19059991836547852 seconds
from luciferml.supervised.regression import Regression
accuracy_scores = {}
dataset = Auto_lin
feature = ['cyl','disp','wt','yr','acc','hp']
X = Auto_lin[feature]
y = Auto_lin['mpg']
regressor = Regression(predictor = 'svr')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are Categorical
Encoding Features [*]
Encoding Features Done [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Converting Sparse Features to array []
Conversion of Sparse Features to array Done [ ✓ ]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Support Vector Machine on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 80.24 %
Validation Mean Absolute Error is : 0.10662421134349838
Validation Root Mean Squared Error is : 0.13945737350119647
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 71.37 %
Standard Deviation: 6.41 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.08940458297729492 seconds
from luciferml.supervised.regression import Regression
accuracy_scores = {}
dataset = Auto_lin
feature = ['cyl','disp','wt','yr','acc','hp']
X = Auto_lin[feature]
y = Auto_lin['mpg']
regressor = Regression(predictor = 'cat')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are Categorical
Encoding Features [*]
Encoding Features Done [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Converting Sparse Features to array []
Conversion of Sparse Features to array Done [ ✓ ]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training CatBoost Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 90.89 %
Validation Mean Absolute Error is : 0.07086485480518286
Validation Root Mean Squared Error is : 0.0946969900419625
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 88.64 %
Standard Deviation: 3.57 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 22.934786558151245 seconds
Regression with Heirarchical clusters
dataset = Hclus
feature = ['cyl','disp','wt','yr','acc']
X = Hclus[feature]
y = Hclus['mpg']
regressor = Regression(predictor = 'bag')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Bagging Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 90.34 %
Validation Mean Absolute Error is : 1.6654999999999998
Validation Root Mean Squared Error is : 2.27956081296376
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 83.03 %
Standard Deviation: 5.87 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.30253100395202637 seconds
dataset = Hclus
feature = ['cyl','disp','wt','yr','acc']
X = Hclus[feature]
y = Hclus['mpg']
regressor = Regression(predictor = 'br')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training BayesianRidge Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 82.50 %
Validation Mean Absolute Error is : 2.4617986033346395
Validation Root Mean Squared Error is : 3.067077108540655
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 79.17 %
Standard Deviation: 3.60 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.04736685752868652 seconds
dataset = Hclus
feature = ['cyl','disp','wt','yr','acc']
X = Hclus[feature]
y = Hclus['mpg']
regressor = Regression(predictor = 'svr')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Support Vector Machine on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 86.08 %
Validation Mean Absolute Error is : 1.851333756066642
Validation Root Mean Squared Error is : 2.735773289092001
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 81.84 %
Standard Deviation: 5.08 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.08737802505493164 seconds
dataset = Hclus
feature = ['cyl','disp','wt','yr','acc']
X = Hclus[feature]
y = Hclus['mpg']
regressor = Regression(predictor = 'cat')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training CatBoost Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 90.57 %
Validation Mean Absolute Error is : 1.645309993415522
Validation Root Mean Squared Error is : 2.251301552860863
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 86.66 %
Standard Deviation: 4.68 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 18.270514965057373 seconds
Regression with Kmeans clusters
kmeans = pd.read_csv('kmeans.csv')
feature = ['cyl','disp','wt','yr','acc']
X = kmeans_data[feature]
y = kmeans_data['mpg']
regressor = Regression(predictor = 'bag')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Bagging Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 81.43 %
Validation Mean Absolute Error is : 2.6346000000000003
Validation Root Mean Squared Error is : 3.823308776439591
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 85.89 %
Standard Deviation: 6.84 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.3941783905029297 seconds
kmeans = pd.read_csv('kmeans.csv')
feature = ['cyl','disp','wt','yr','acc']
X = kmeans_data[feature]
y = kmeans_data['mpg']
regressor = Regression(predictor = 'br')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training BayesianRidge Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 82.88 %
Validation Mean Absolute Error is : 2.934162854816411
Validation Root Mean Squared Error is : 3.6715902504466014
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 80.15 %
Standard Deviation: 4.51 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.04691505432128906 seconds
kmeans = pd.read_csv('kmeans.csv')
feature = ['cyl','disp','wt','yr','acc']
X = kmeans_data[feature]
y = kmeans_data['mpg']
regressor = Regression(predictor = 'svr')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Support Vector Machine on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 79.82 %
Validation Mean Absolute Error is : 2.813372529040631
Validation Root Mean Squared Error is : 3.986389680999501
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 82.64 %
Standard Deviation: 5.60 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 0.06808638572692871 seconds
kmeans = pd.read_csv('kmeans.csv')
feature = ['cyl','disp','wt','yr','acc']
X = kmeans_data[feature]
y = kmeans_data['mpg']
regressor = Regression(predictor = 'cat')
regressor.fit(X, y)
result = regressor.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started Lucifer-ML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are not categorical [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training CatBoost Regressor on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Evaluating Model Performance [*]
Validation R2 Score is 86.63 %
Validation Mean Absolute Error is : 2.2946728466560553
Validation Root Mean Squared Error is : 3.244851620902452
Evaluating Model Performance [ ✓ ]
Applying K-Fold Cross Validation [*]
R2 Score: 88.83 %
Standard Deviation: 6.33 %
K-Fold Cross Validation [ ✓ ]
Complete [ ✓ ]
Time Elapsed : 14.162353277206421 seconds
5&6. The regression models has been generated with original data, Hierarchial clusters and Kmeans clusters. The four regression models used in the project are namely Bayesian Ridge Regressor, Support Vector Regressor, Bagging Regressor and Catboost Regressor.
"Kmeans clusters with Cat boost regressor has highest R2 score of 88.3%"
The inference from data anaylsis plots show that Miles per gallon has improved over the years due to constant innovation in car industry or construction of new roads in the city (which might have reduced the traffic).
To improve the dataset it would be beneficial to given details on fuel price data over the years and average kilometers driven.
from sdv.tabular import GaussianCopula
import pandas as pd
import numpy as np
real_data = pd.read_csv('Company.csv')
import warnings
warnings.filterwarnings("ignore")
model = GaussianCopula()
model.fit(real_data)
synthetic_data = model.sample()
real_data.head()
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 32 | 88 | 98 | 101 | Quality A |
| 1 | 168 | 137 | 167 | 181 | Quality B |
| 2 | 149 | 145 | 154 | 144 | NaN |
| 3 | 96 | 33 | 27 | 31 | Quality A |
| 4 | 13 | 76 | 102 | 94 | Quality A |
synthetic_data.head()
| A | B | C | D | Quality | |
|---|---|---|---|---|---|
| 0 | 197 | 197 | 199 | 195 | Quality B |
| 1 | 34 | 64 | 64 | 14 | Quality A |
| 2 | 157 | 98 | 13 | 50 | Quality B |
| 3 | 155 | 149 | 122 | 155 | Quality B |
| 4 | 15 | 28 | 13 | 4 | Quality A |
from sdv.evaluation import evaluate
evaluate(synthetic_data, real_data)
0.5499799634809122
evaluate(synthetic_data, real_data, aggregate=False)
| metric | name | raw_score | normalized_score | min_value | max_value | goal | |
|---|---|---|---|---|---|---|---|
| 1 | LogisticDetection | LogisticRegression Detection | 0.736032 | 7.360317e-01 | 0.0 | 1.0 | MAXIMIZE |
| 2 | SVCDetection | SVC Detection | 0.555952 | 5.559524e-01 | 0.0 | 1.0 | MAXIMIZE |
| 11 | GMLogLikelihood | GaussianMixture Log Likelihood | -52.198297 | 2.140759e-23 | -inf | inf | MAXIMIZE |
| 12 | CSTest | Chi-Squared | 0.801452 | 8.014521e-01 | 0.0 | 1.0 | MAXIMIZE |
| 13 | KSTest | Inverted Kolmogorov-Smirnov D statistic | 0.860656 | 8.606557e-01 | 0.0 | 1.0 | MAXIMIZE |
| 14 | KSTestExtended | Inverted Kolmogorov-Smirnov D statistic | 0.852459 | 8.524590e-01 | 0.0 | 1.0 | MAXIMIZE |
| 27 | ContinuousKLDivergence | Continuous Kullback–Leibler Divergence | 0.148706 | 1.487055e-01 | 0.0 | 1.0 | MAXIMIZE |
evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'])
0.8310539395438277
from luciferml.supervised.classification import Classification
accuracy_scores = {}
feature = ['A','B','C','D']
X = real_data[feature]
y = real_data['Quality']
classifier = Classification(predictor = 'lr')
classifier.fit(X, y)
result = classifier.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started LuciferML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are Categorical [*]
Encoding Labels
Encoding Labels Done [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Logistic Regression on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Making Confusion Matrix [*]
[[5 0 0]
[0 2 2]
[0 3 1]]
Confusion Matrix Done [ ✓ ] Evaluating Model Performance [*] Validation Accuracy is : 0.6153846153846154 Evaluating Model Performance [ ✓ ] Applying K-Fold Cross Validation [*] Accuracy: 64.00 % Standard Deviation: 14.28 % K-Fold Cross Validation [ ✓ ] Complete [ ✓ ] Time Elapsed : 0.550457239151001 seconds
feature = ['A','B','C','D']
X = synthetic_data[feature]
y = synthetic_data['Quality']
classifier = Classification(predictor = 'lr')
classifier.fit(X, y)
result = classifier.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started LuciferML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are Categorical [*]
Encoding Labels
Encoding Labels Done [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Logistic Regression on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Making Confusion Matrix [*]
[[5 1]
[1 6]]
Confusion Matrix Done [ ✓ ] Evaluating Model Performance [*] Validation Accuracy is : 0.8461538461538461 Evaluating Model Performance [ ✓ ] Applying K-Fold Cross Validation [*] Accuracy: 75.00 % Standard Deviation: 24.19 % K-Fold Cross Validation [ ✓ ] Complete [ ✓ ] Time Elapsed : 0.34400057792663574 seconds
After applying the GaussianCopula model the data accuracy has increased. The accuracy scores has been generated using logistic regression.
import h2o
from h2o.estimators import H2OPrincipalComponentAnalysisEstimator
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
| H2O_cluster_uptime: | 25 mins 57 secs |
| H2O_cluster_timezone: | Asia/Kolkata |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.34.0.3 |
| H2O_cluster_version_age: | 1 month and 13 days |
| H2O_cluster_name: | H2O_from_python_Admin_95j8sn |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 1.444 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| H2O_API_Extensions: | Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
| Python_version: | 3.8.8 final |
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import plotly.express as px
from luciferml.supervised import classification as cls
Data
gm = pd.read_csv('Part3 - vehicle.csv')
print(gm.shape)
gm.head()
(846, 19)
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | van |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | van |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | car |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | van |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | bus |
EDA and visualisation
print(gm.isnull().sum())
compactness 0 circularity 5 distance_circularity 4 radius_ratio 6 pr.axis_aspect_ratio 2 max.length_aspect_ratio 0 scatter_ratio 1 elongatedness 1 pr.axis_rectangularity 3 max.length_rectangularity 0 scaled_variance 3 scaled_variance.1 2 scaled_radius_of_gyration 2 scaled_radius_of_gyration.1 4 skewness_about 6 skewness_about.1 1 skewness_about.2 1 hollows_ratio 0 class 0 dtype: int64
for cols in gm.columns:
if(cols != 'class'):
gm[cols] = gm[cols].fillna(gm[cols].median())
print(gm.isnull().sum())
compactness 0 circularity 0 distance_circularity 0 radius_ratio 0 pr.axis_aspect_ratio 0 max.length_aspect_ratio 0 scatter_ratio 0 elongatedness 0 pr.axis_rectangularity 0 max.length_rectangularity 0 scaled_variance 0 scaled_variance.1 0 scaled_radius_of_gyration 0 scaled_radius_of_gyration.1 0 skewness_about 0 skewness_about.1 0 skewness_about.2 0 hollows_ratio 0 class 0 dtype: int64
gm.describe().T.style.bar(
subset=['mean'],
color='Reds').background_gradient(
subset=['std'], cmap='ocean').background_gradient(subset=['50%'], cmap='PuBu')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| compactness | 846.000000 | 93.678487 | 8.234474 | 73.000000 | 87.000000 | 93.000000 | 100.000000 | 119.000000 |
| circularity | 846.000000 | 44.823877 | 6.134272 | 33.000000 | 40.000000 | 44.000000 | 49.000000 | 59.000000 |
| distance_circularity | 846.000000 | 82.100473 | 15.741569 | 40.000000 | 70.000000 | 80.000000 | 98.000000 | 112.000000 |
| radius_ratio | 846.000000 | 168.874704 | 33.401356 | 104.000000 | 141.000000 | 167.000000 | 195.000000 | 333.000000 |
| pr.axis_aspect_ratio | 846.000000 | 61.677305 | 7.882188 | 47.000000 | 57.000000 | 61.000000 | 65.000000 | 138.000000 |
| max.length_aspect_ratio | 846.000000 | 8.567376 | 4.601217 | 2.000000 | 7.000000 | 8.000000 | 10.000000 | 55.000000 |
| scatter_ratio | 846.000000 | 168.887707 | 33.197710 | 112.000000 | 147.000000 | 157.000000 | 198.000000 | 265.000000 |
| elongatedness | 846.000000 | 40.936170 | 7.811882 | 26.000000 | 33.000000 | 43.000000 | 46.000000 | 61.000000 |
| pr.axis_rectangularity | 846.000000 | 20.580378 | 2.588558 | 17.000000 | 19.000000 | 20.000000 | 23.000000 | 29.000000 |
| max.length_rectangularity | 846.000000 | 147.998818 | 14.515652 | 118.000000 | 137.000000 | 146.000000 | 159.000000 | 188.000000 |
| scaled_variance | 846.000000 | 188.596927 | 31.360427 | 130.000000 | 167.000000 | 179.000000 | 217.000000 | 320.000000 |
| scaled_variance.1 | 846.000000 | 439.314421 | 176.496341 | 184.000000 | 318.250000 | 363.500000 | 586.750000 | 1018.000000 |
| scaled_radius_of_gyration | 846.000000 | 174.706856 | 32.546277 | 109.000000 | 149.000000 | 173.500000 | 198.000000 | 268.000000 |
| scaled_radius_of_gyration.1 | 846.000000 | 72.443262 | 7.468734 | 59.000000 | 67.000000 | 71.500000 | 75.000000 | 135.000000 |
| skewness_about | 846.000000 | 6.361702 | 4.903244 | 0.000000 | 2.000000 | 6.000000 | 9.000000 | 22.000000 |
| skewness_about.1 | 846.000000 | 12.600473 | 8.930962 | 0.000000 | 5.000000 | 11.000000 | 19.000000 | 41.000000 |
| skewness_about.2 | 846.000000 | 188.918440 | 6.152247 | 176.000000 | 184.000000 | 188.000000 | 193.000000 | 206.000000 |
| hollows_ratio | 846.000000 | 195.632388 | 7.438797 | 181.000000 | 190.250000 | 197.000000 | 201.000000 | 211.000000 |
def boxhistplot(columns,data):
fig = px.histogram(gm, x = gm[column], color = 'class')
fig.show()
fig2 = px.box(gm, x = gm[column], color = 'class')
fig2.show()
col = ['compactness','circularity','distance_circularity', 'radius_ratio','pr.axis_aspect_ratio','max.length_aspect_ratio','scatter_ratio','elongatedness','pr.axis_rectangularity','max.length_rectangularity',
'scaled_variance', 'scaled_variance.1','scaled_radius_of_gyration','scaled_radius_of_gyration.1', 'skewness_about' , 'skewness_about.1', 'skewness_about.2','hollows_ratio']
for column in col:
boxhistplot(column, gm)
for col_name in gm.drop(columns = 'class').columns:
q1 = gm[col_name].quantile(0.25)
q3 = gm[col_name].quantile(0.75)
iqr = q3 - q1
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr
gm.loc[(gm[col_name] < low) | (gm[col_name] > high), col_name] = gm[col_name].median()
sns.pairplot(gm, diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x1817113e640>
plt.figure(figsize=(20,18))
sns.heatmap(gm.corr(), annot=True, cmap="flag")
<AxesSubplot:>
Dimensional reduction and Classifier
df = h2o.import_file("C:\\Users\\Admin\\Desktop\\UL_Project\\Part3 - vehicle.csv")
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
import numpy as np
df.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95 | 48 | 83 | 178 | 72 | 10 | 162 | 42 | 20 | 159 | 176 | 379 | 184 | 70 | 6 | 16 | 187 | 197 | van |
| 91 | 41 | 84 | 141 | 57 | 9 | 149 | 45 | 19 | 143 | 170 | 330 | 158 | 72 | 9 | 14 | 189 | 199 | van |
| 104 | 50 | 106 | 209 | 66 | 10 | 207 | 32 | 23 | 158 | 223 | 635 | 220 | 73 | 14 | 9 | 188 | 196 | car |
| 93 | 41 | 82 | 159 | 63 | 9 | 144 | 46 | 19 | 143 | 160 | 309 | 127 | 63 | 6 | 10 | 199 | 207 | van |
| 85 | 44 | 70 | 205 | 103 | 52 | 149 | 45 | 19 | 144 | 241 | 325 | 188 | 127 | 9 | 11 | 180 | 183 | bus |
| 107 | nan | 106 | 172 | 50 | 6 | 255 | 26 | 28 | 169 | 280 | 957 | 264 | 85 | 5 | 9 | 181 | 183 | bus |
| 97 | 43 | 73 | 173 | 65 | 6 | 153 | 42 | 19 | 143 | 176 | 361 | 172 | 66 | 13 | 1 | 200 | 204 | bus |
| 90 | 43 | 66 | 157 | 65 | 9 | 137 | 48 | 18 | 146 | 162 | 281 | 164 | 67 | 3 | 3 | 193 | 202 | van |
| 86 | 34 | 62 | 140 | 61 | 7 | 122 | 54 | 17 | 127 | 141 | 223 | 112 | 64 | 2 | 14 | 200 | 208 | van |
| 93 | 44 | 98 | nan | 62 | 11 | 183 | 36 | 22 | 146 | 202 | 505 | 152 | 64 | 4 | 14 | 195 | 204 | car |
df.describe()
Rows:846 Cols:19
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| type | int | int | int | int | int | int | int | int | int | int | int | int | int | int | int | int | int | int | enum |
| mins | 73.0 | 33.0 | 40.0 | 104.0 | 47.0 | 2.0 | 112.0 | 26.0 | 17.0 | 118.0 | 130.0 | 184.0 | 109.0 | 59.0 | 0.0 | 0.0 | 176.0 | 181.0 | |
| mean | 93.67848699763604 | 44.82877526753867 | 82.1104513064132 | 168.88809523809502 | 61.67890995260661 | 8.56737588652483 | 168.9017751479289 | 40.93372781065084 | 20.582443653618018 | 147.99881796690298 | 188.63107947805455 | 439.4940758293837 | 174.70971563981058 | 72.44774346793342 | 6.364285714285713 | 12.602366863905331 | 188.91952662721883 | 195.63238770685578 | |
| maxs | 119.0 | 59.0 | 112.0 | 333.0 | 138.0 | 55.0 | 265.0 | 61.0 | 29.0 | 188.0 | 320.0 | 1018.0 | 268.0 | 135.0 | 22.0 | 41.0 | 206.0 | 211.0 | |
| sigma | 8.234474253334252 | 6.152171861813706 | 15.778291805673964 | 33.52019797334269 | 7.891463065900405 | 4.601216661132584 | 33.21484792133684 | 7.816185718829101 | 2.5929330612155854 | 14.515651573835008 | 31.411003596400807 | 176.66690269589344 | 32.584808232647724 | 7.486190275604501 | 4.920649076276184 | 8.936081294039763 | 6.155809475363881 | 7.438797429122352 | |
| zeros | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 77 | 30 | 0 | 0 | |
| missing | 0 | 5 | 4 | 6 | 2 | 0 | 1 | 1 | 3 | 0 | 3 | 2 | 2 | 4 | 6 | 1 | 1 | 0 | 0 |
| 0 | 95.0 | 48.0 | 83.0 | 178.0 | 72.0 | 10.0 | 162.0 | 42.0 | 20.0 | 159.0 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197.0 | van |
| 1 | 91.0 | 41.0 | 84.0 | 141.0 | 57.0 | 9.0 | 149.0 | 45.0 | 19.0 | 143.0 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199.0 | van |
| 2 | 104.0 | 50.0 | 106.0 | 209.0 | 66.0 | 10.0 | 207.0 | 32.0 | 23.0 | 158.0 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196.0 | car |
| 3 | 93.0 | 41.0 | 82.0 | 159.0 | 63.0 | 9.0 | 144.0 | 46.0 | 19.0 | 143.0 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207.0 | van |
| 4 | 85.0 | 44.0 | 70.0 | 205.0 | 103.0 | 52.0 | 149.0 | 45.0 | 19.0 | 144.0 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183.0 | bus |
| 5 | 107.0 | nan | 106.0 | 172.0 | 50.0 | 6.0 | 255.0 | 26.0 | 28.0 | 169.0 | 280.0 | 957.0 | 264.0 | 85.0 | 5.0 | 9.0 | 181.0 | 183.0 | bus |
| 6 | 97.0 | 43.0 | 73.0 | 173.0 | 65.0 | 6.0 | 153.0 | 42.0 | 19.0 | 143.0 | 176.0 | 361.0 | 172.0 | 66.0 | 13.0 | 1.0 | 200.0 | 204.0 | bus |
| 7 | 90.0 | 43.0 | 66.0 | 157.0 | 65.0 | 9.0 | 137.0 | 48.0 | 18.0 | 146.0 | 162.0 | 281.0 | 164.0 | 67.0 | 3.0 | 3.0 | 193.0 | 202.0 | van |
| 8 | 86.0 | 34.0 | 62.0 | 140.0 | 61.0 | 7.0 | 122.0 | 54.0 | 17.0 | 127.0 | 141.0 | 223.0 | 112.0 | 64.0 | 2.0 | 14.0 | 200.0 | 208.0 | van |
| 9 | 93.0 | 44.0 | 98.0 | nan | 62.0 | 11.0 | 183.0 | 36.0 | 22.0 | 146.0 | 202.0 | 505.0 | 152.0 | 64.0 | 4.0 | 14.0 | 195.0 | 204.0 | car |
df.columns
['compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio', 'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'hollows_ratio', 'class']
df.isna()
| isNA(compactness) | isNA(circularity) | isNA(distance_circularity) | isNA(radius_ratio) | isNA(pr.axis_aspect_ratio) | isNA(max.length_aspect_ratio) | isNA(scatter_ratio) | isNA(elongatedness) | isNA(pr.axis_rectangularity) | isNA(max.length_rectangularity) | isNA(scaled_variance) | isNA(scaled_variance.1) | isNA(scaled_radius_of_gyration) | isNA(scaled_radius_of_gyration.1) | isNA(skewness_about) | isNA(skewness_about.1) | isNA(skewness_about.2) | isNA(hollows_ratio) | isNA(class) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
df[ df["circularity"].isna(), "circularity"] = 0
df[ df["distance_circularity"].isna(), "distance_circularity"] = 0
df[ df["radius_ratio"].isna(), "radius_ratio"] = 0
df[ df["pr.axis_aspect_ratio"].isna(), "pr.axis_aspect_ratio"] = 0
df[ df["scatter_ratio"].isna(), "scatter_ratio"] = 0
df[ df["elongatedness"].isna(), "elongatedness"] = 0
df[ df["pr.axis_rectangularity"].isna(), "pr.axis_rectangularity"] = 0
df[ df["scaled_variance"].isna(), "scaled_variance"] = 0
df[ df["scaled_variance.1"].isna(), "scaled_variance.1"] = 0
df[ df["scaled_radius_of_gyration"].isna(), "scaled_radius_of_gyration"] = 0
df[ df["scaled_radius_of_gyration.1"].isna(), "scaled_radius_of_gyration.1"] = 0
df[ df["skewness_about"].isna(), "skewness_about"] = 0
df[ df["skewness_about.1"].isna(), "skewness_about.1"] = 0
df[ df["skewness_about.2"].isna(), "skewness_about.2"] = 0
df.isna()
| isNA(compactness) | isNA(circularity) | isNA(distance_circularity) | isNA(radius_ratio) | isNA(pr.axis_aspect_ratio) | isNA(max.length_aspect_ratio) | isNA(scatter_ratio) | isNA(elongatedness) | isNA(pr.axis_rectangularity) | isNA(max.length_rectangularity) | isNA(scaled_variance) | isNA(scaled_variance.1) | isNA(scaled_radius_of_gyration) | isNA(scaled_radius_of_gyration.1) | isNA(skewness_about) | isNA(skewness_about.1) | isNA(skewness_about.2) | isNA(hollows_ratio) | isNA(class) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
train, valid = df.split_frame(ratios = [.8], seed = 1234)
gm_pca = H2OPrincipalComponentAnalysisEstimator(k = 6,
use_all_factor_levels = True,
pca_method = "glrm",
transform = "standardize",
impute_missing = True)
gm_pca.train(training_frame = train)
pca Model Build progress: |██████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OPrincipalComponentAnalysisEstimator : Principal Components Analysis Model Key: PCA_model_python_1637400833869_8 Importance of components:
| pc1 | pc2 | pc3 | pc4 | pc5 | pc6 | ||
|---|---|---|---|---|---|---|---|
| 0 | Standard deviation | 3.301050 | 2.141384 | 1.387178 | 0.800529 | 0.760328 | 0.613035 |
| 1 | Proportion of Variance | 0.573478 | 0.241325 | 0.101269 | 0.033726 | 0.030424 | 0.019778 |
| 2 | Cumulative Proportion | 0.573478 | 0.814803 | 0.916072 | 0.949798 | 0.980222 | 1.000000 |
ModelMetricsPCA: pca ** Reported on train data. ** MSE: NaN RMSE: NaN Scoring history from GLRM:
| timestamp | duration | iterations | step_size | objective | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-20 15:37:02 | 0.293 sec | 0.0 | 0.666667 | 23864.710001 | |
| 1 | 2021-11-20 15:37:02 | 0.326 sec | 1.0 | 0.444444 | 23864.710001 | |
| 2 | 2021-11-20 15:37:02 | 0.332 sec | 2.0 | 0.222222 | 23864.710001 | |
| 3 | 2021-11-20 15:37:02 | 0.363 sec | 3.0 | 0.074074 | 23864.710001 | |
| 4 | 2021-11-20 15:37:02 | 0.379 sec | 4.0 | 0.077778 | 13230.498666 | |
| 5 | 2021-11-20 15:37:02 | 0.379 sec | 5.0 | 0.081667 | 6517.755051 | |
| 6 | 2021-11-20 15:37:02 | 0.395 sec | 6.0 | 0.085750 | 5434.790193 | |
| 7 | 2021-11-20 15:37:02 | 0.395 sec | 7.0 | 0.090038 | 5013.298733 | |
| 8 | 2021-11-20 15:37:02 | 0.410 sec | 8.0 | 0.094539 | 4621.712235 | |
| 9 | 2021-11-20 15:37:02 | 0.410 sec | 9.0 | 0.099266 | 4332.412708 | |
| 10 | 2021-11-20 15:37:02 | 0.426 sec | 10.0 | 0.104230 | 4183.165338 | |
| 11 | 2021-11-20 15:37:02 | 0.426 sec | 11.0 | 0.109441 | 4105.569531 | |
| 12 | 2021-11-20 15:37:02 | 0.426 sec | 12.0 | 0.114913 | 4031.815344 | |
| 13 | 2021-11-20 15:37:02 | 0.441 sec | 13.0 | 0.120659 | 3934.514827 | |
| 14 | 2021-11-20 15:37:02 | 0.441 sec | 14.0 | 0.126692 | 3803.628521 | |
| 15 | 2021-11-20 15:37:02 | 0.455 sec | 15.0 | 0.133026 | 3646.320972 | |
| 16 | 2021-11-20 15:37:02 | 0.455 sec | 16.0 | 0.139678 | 3495.534944 | |
| 17 | 2021-11-20 15:37:02 | 0.455 sec | 17.0 | 0.146662 | 3405.594955 | |
| 18 | 2021-11-20 15:37:02 | 0.471 sec | 18.0 | 0.097774 | 3405.594955 | |
| 19 | 2021-11-20 15:37:02 | 0.471 sec | 19.0 | 0.102663 | 3323.893952 |
See the whole table with table.as_data_frame()
pred = gm_pca.predict(valid)
pred.head()
pca prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 |
|---|---|---|---|---|---|
| 0.261711 | -1.69288 | -0.591581 | 2.3436 | -0.480915 | 0.514556 |
| -0.738633 | 0.0100112 | -0.18286 | 1.11567 | 1.32458 | -0.155437 |
| 0.596814 | 1.01228 | 1.02359 | 0.41501 | -0.605221 | 1.50541 |
| 0.595886 | 0.79644 | 0.197733 | 1.47802 | -0.739702 | 0.464038 |
| -1.02738 | 2.12535 | 3.36802 | -0.898483 | 0.660193 | 0.0493423 |
| -1.05419 | 1.94706 | 2.63529 | -1.13959 | 4.75732 | 0.728513 |
| 1.01103 | 0.146326 | -1.2997 | -0.428888 | -0.0477065 | 0.211118 |
| 0.961258 | -2.68362 | -3.73144 | -1.25351 | 0.552719 | -0.571587 |
| -0.0154917 | -1.64509 | -4.06778 | -1.20246 | 0.624953 | -0.791027 |
| -0.30561 | -0.922264 | -2.80582 | -1.04617 | 0.832248 | -1.80275 |
h2o.download_csv(pred,"C:\\Users\\Admin\\Desktop\\UL_Project")
'C:\\Users\\Admin\\Desktop\\UL_Project\\unknown'
pca_df = pd.read_csv('unknown.csv')
print(pca_df.shape)
pca_df.head()
(169, 6)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| 0 | -0.550515 | -1.455035 | -0.620586 | 0.904911 | 2.329491 | 0.432266 |
| 1 | 0.672855 | -0.140634 | 0.251183 | 1.682191 | 0.280005 | 0.229848 |
| 2 | -0.537566 | 1.134284 | 0.883038 | -0.495166 | 0.850183 | 1.224060 |
| 3 | -0.541855 | 0.863570 | 0.322688 | 0.231745 | 1.611843 | -0.171900 |
| 4 | 0.878768 | 1.551943 | 3.517661 | -0.669590 | -1.139809 | 0.425993 |
pca_df.columns
Index(['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6'], dtype='object')
pca_df.dtypes
PC1 float64 PC2 float64 PC3 float64 PC4 float64 PC5 float64 PC6 float64 dtype: object
sns.pairplot(pca_df, diag_kind = 'kde');
pred.summary()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| type | real | real | real | real | real | real |
| mins | -1.49035277978515 | -3.0920713463503735 | -5.399418124579308 | -2.7054723652368233 | -1.3717406947379076 | -2.3102232737872446 |
| mean | -0.2663136922419451 | 0.09617222091974326 | 0.16161760853023388 | 0.4684380039753266 | 0.23011668527354476 | -0.0668079797009858 |
| maxs | 1.468860362329572 | 3.4518629579161084 | 4.550456780703172 | 3.837174103143492 | 4.757315575856198 | 2.5453756212079823 |
| sigma | 0.7875834015148941 | 1.5380830712699036 | 2.4292523425515915 | 1.3376988152353748 | 0.8068389016891415 | 0.8601966684630572 |
| zeros | 0 | 0 | 0 | 0 | 0 | 0 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0.261711319263233 | -1.6928822142736484 | -0.5915806113495691 | 2.3436005984809665 | -0.48091475538188555 | 0.5145560658086726 |
| 1 | -0.7386334263842358 | 0.010011161913011914 | -0.1828598135587042 | 1.1156737708088627 | 1.324575798913365 | -0.15543722552337652 |
| 2 | 0.5968141393892453 | 1.0122840864021327 | 1.0235887103548709 | 0.41500977220319457 | -0.6052210106464788 | 1.5054096811611142 |
| 3 | 0.5958862654262771 | 0.7964396593221901 | 0.19773281383976354 | 1.4780174909839174 | -0.7397023911011823 | 0.4640379898579617 |
| 4 | -1.0273817059628179 | 2.125348417784135 | 3.3680227532767324 | -0.8984827417854093 | 0.6601934964572601 | 0.0493423319518148 |
| 5 | -1.0541884349084514 | 1.9470636025193206 | 2.635289837370166 | -1.1395945414093125 | 4.757315575856198 | 0.7285134312055264 |
| 6 | 1.011028467067154 | 0.14632640803372077 | -1.2997009972977842 | -0.4288879515229652 | -0.04770645029778173 | 0.21111757344907353 |
| 7 | 0.9612576898504896 | -2.6836185824843266 | -3.7314392110386714 | -1.2535125164298881 | 0.5527187280024535 | -0.5715866407590426 |
| 8 | -0.015491689870722941 | -1.6450937545359232 | -4.067776090193787 | -1.2024585683802176 | 0.624952548446712 | -0.7910272411540449 |
| 9 | -0.30561009354076696 | -0.9222642756323093 | -2.8058172541330917 | -1.0461725474975445 | 0.8322481035397795 | -1.8027489317104204 |
PCA = pd.concat([gm.reset_index(drop=True), pca_df.reset_index(drop=True)], axis=1)
PCA.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | ... | skewness_about.1 | skewness_about.2 | hollows_ratio | class | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | ... | 16.0 | 187.0 | 197 | van | -0.550515 | -1.455035 | -0.620586 | 0.904911 | 2.329491 | 0.432266 |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | ... | 14.0 | 189.0 | 199 | van | 0.672855 | -0.140634 | 0.251183 | 1.682191 | 0.280005 | 0.229848 |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | ... | 9.0 | 188.0 | 196 | car | -0.537566 | 1.134284 | 0.883038 | -0.495166 | 0.850183 | 1.224060 |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | ... | 10.0 | 199.0 | 207 | van | -0.541855 | 0.863570 | 0.322688 | 0.231745 | 1.611843 | -0.171900 |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 61.0 | 8 | 149.0 | 45.0 | 19.0 | 144 | ... | 11.0 | 180.0 | 183 | bus | 0.878768 | 1.551943 | 3.517661 | -0.669590 | -1.139809 | 0.425993 |
5 rows × 25 columns
for cols in PCA.columns:
if(cols != 'class'):
PCA[cols] = PCA[cols].fillna(PCA[cols].median())
print(PCA.isnull().sum())
compactness 0 circularity 0 distance_circularity 0 radius_ratio 0 pr.axis_aspect_ratio 0 max.length_aspect_ratio 0 scatter_ratio 0 elongatedness 0 pr.axis_rectangularity 0 max.length_rectangularity 0 scaled_variance 0 scaled_variance.1 0 scaled_radius_of_gyration 0 scaled_radius_of_gyration.1 0 skewness_about 0 skewness_about.1 0 skewness_about.2 0 hollows_ratio 0 class 0 PC1 0 PC2 0 PC3 0 PC4 0 PC5 0 PC6 0 dtype: int64
from luciferml.supervised.classification import Classification
accuracy_scores = {}
feature = ['compactness','circularity','distance_circularity','radius_ratio','pr.axis_aspect_ratio', 'max.length_aspect_ratio',
'scatter_ratio','elongatedness','pr.axis_rectangularity', 'max.length_rectangularity','scaled_variance','scaled_variance.1',
'scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio']
X = gm[feature]
y = gm['class']
classifier = Classification(predictor = 'svm')
classifier.fit(X, y)
result = classifier.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started LuciferML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are Categorical [*]
Encoding Labels
Encoding Labels Done [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Support Vector Machine on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Making Confusion Matrix [*]
[[50 1 1]
[ 1 75 2]
[ 0 0 40]]
Confusion Matrix Done [ ✓ ] Evaluating Model Performance [*] Validation Accuracy is : 0.9705882352941176 Evaluating Model Performance [ ✓ ] Applying K-Fold Cross Validation [*] Accuracy: 95.26 % Standard Deviation: 2.46 % K-Fold Cross Validation [ ✓ ] Complete [ ✓ ] Time Elapsed : 0.7289822101593018 seconds
feature = ['distance_circularity','radius_ratio','pr.axis_aspect_ratio', 'max.length_aspect_ratio',
'scatter_ratio','pr.axis_rectangularity', 'max.length_rectangularity','scaled_variance.1','scaled_radius_of_gyration.1','skewness_about.1','skewness_about.2','hollows_ratio','PC2','PC3','PC5','PC6']
X = PCA[feature]
y = PCA['class']
classifier = Classification(predictor = 'svm')
classifier.fit(X,y)
result = classifier.result()
██╗░░░░░██╗░░░██╗░█████╗░██╗███████╗███████╗██████╗░░░░░░░███╗░░░███╗██╗░░░░░
██║░░░░░██║░░░██║██╔══██╗██║██╔════╝██╔════╝██╔══██╗░░░░░░████╗░████║██║░░░░░
██║░░░░░██║░░░██║██║░░╚═╝██║█████╗░░█████╗░░██████╔╝█████╗██╔████╔██║██║░░░░░
██║░░░░░██║░░░██║██║░░██╗██║██╔══╝░░██╔══╝░░██╔══██╗╚════╝██║╚██╔╝██║██║░░░░░
███████╗╚██████╔╝╚█████╔╝██║██║░░░░░███████╗██║░░██║░░░░░░██║░╚═╝░██║███████╗
╚══════╝░╚═════╝░░╚════╝░╚═╝╚═╝░░░░░╚══════╝╚═╝░░╚═╝░░░░░░╚═╝░░░░░╚═╝╚══════╝
Started LuciferML
Checking if labels or features are categorical! [*]
Features are not categorical [ ✓ ]
Labels are Categorical [*]
Encoding Labels
Encoding Labels Done [ ✓ ]
Checking for Categorical Variables Done [ ✓ ]
Checking for Sparse Matrix [*]
Splitting Data into Train and Validation Sets [*]
Splitting Done [ ✓ ]
Scaling Training and Test Sets [*]
Scaling Done [ ✓ ]
Training Support Vector Machine on Training Set [*]
Model Training Done [ ✓ ]
Predicting Data [*]
Data Prediction Done [ ✓ ]
Making Confusion Matrix [*]
[[44 7 1]
[ 0 75 3]
[ 1 1 38]]
Confusion Matrix Done [ ✓ ] Evaluating Model Performance [*] Validation Accuracy is : 0.9235294117647059 Evaluating Model Performance [ ✓ ] Applying K-Fold Cross Validation [*] Accuracy: 90.97 % Standard Deviation: 2.93 % K-Fold Cross Validation [ ✓ ] Complete [ ✓ ] Time Elapsed : 0.4462108612060547 seconds
Both the models gave above 90% accuracy. The SVM accuracy with PCA data is slightly low despite of excluding huge chucks of data points.
EDA and visualisation
ipl = pd.read_csv('Part4 - batting_bowling_ipl_bat.csv')
ipl2 = ipl.dropna()
print(ipl2.to_string())
Name Runs Ave SR Fours Sixes HF 1 CH Gayle 733.0 61.08 160.74 46.0 59.0 9.0 3 G Gambhir 590.0 36.87 143.55 64.0 17.0 6.0 5 V Sehwag 495.0 33.00 161.23 57.0 19.0 5.0 7 CL White 479.0 43.54 149.68 41.0 20.0 5.0 9 S Dhawan 569.0 40.64 129.61 58.0 18.0 5.0 11 AM Rahane 560.0 40.00 129.33 73.0 10.0 5.0 13 KP Pietersen 305.0 61.00 147.34 22.0 20.0 3.0 15 RG Sharma 433.0 30.92 126.60 39.0 18.0 5.0 17 AB de Villiers 319.0 39.87 161.11 26.0 15.0 3.0 19 JP Duminy 244.0 81.33 128.42 13.0 11.0 2.0 21 DA Warner 256.0 36.57 164.10 28.0 14.0 3.0 23 SR Watson 255.0 42.50 151.78 26.0 14.0 2.0 25 F du Plessis 398.0 33.16 130.92 29.0 17.0 3.0 27 OA Shah 340.0 37.77 132.81 24.0 16.0 3.0 29 DJ Bravo 371.0 46.37 140.53 20.0 20.0 0.0 31 DJ Hussey 396.0 33.00 129.83 28.0 17.0 2.0 33 SK Raina 441.0 25.94 135.69 36.0 19.0 1.0 35 AT Rayudu 333.0 37.00 132.14 21.0 14.0 2.0 37 Mandeep Singh 432.0 27.00 126.31 53.0 7.0 2.0 39 R Dravid 462.0 28.87 112.13 63.0 4.0 2.0 41 DR Smith 157.0 39.25 160.20 18.0 7.0 1.0 43 M Vijay 336.0 25.84 125.84 39.0 10.0 2.0 45 SPD Smith 362.0 40.22 135.58 24.0 14.0 0.0 47 TM Dilshan 285.0 35.62 109.19 33.0 5.0 3.0 49 RV Uthappa 405.0 27.00 118.07 38.0 10.0 2.0 51 SE Marsh 336.0 30.54 120.00 39.0 7.0 2.0 53 KA Pollard 220.0 24.44 138.36 15.0 14.0 2.0 55 DMD Jayawardene 335.0 27.91 112.41 39.0 3.0 3.0 57 V Kohli 364.0 28.00 111.65 33.0 9.0 2.0 59 MA Agarwal 225.0 20.45 142.40 19.0 15.0 1.0 61 SR Tendulkar 324.0 29.45 114.48 39.0 4.0 2.0 63 MEK Hussey 261.0 32.62 110.59 28.0 8.0 2.0 65 JH Kallis 409.0 25.56 106.51 34.0 10.0 2.0 67 MS Dhoni 357.0 29.75 128.41 26.0 9.0 1.0 69 MS Bisla 213.0 30.42 133.12 16.0 10.0 1.0 71 JD Ryder 256.0 25.60 120.75 23.0 8.0 2.0 73 BJ Hodge 245.0 30.62 140.00 18.0 9.0 0.0 75 NV Ojha 255.0 23.18 113.83 21.0 13.0 1.0 77 DB Das 126.0 42.00 135.48 9.0 6.0 0.0 79 AC Gilchrist 172.0 34.40 120.27 21.0 4.0 1.0 81 BB McCullum 289.0 24.08 102.12 37.0 3.0 1.0 83 IK Pathan 176.0 25.14 139.68 14.0 6.0 0.0 85 Azhar Mahmood 186.0 23.25 130.98 16.0 8.0 0.0 87 MK Pandey 143.0 20.42 127.67 12.0 6.0 1.0 89 S Badrinath 196.0 28.00 108.28 23.0 2.0 1.0 91 DA Miller 98.0 32.66 130.66 6.0 4.0 0.0 93 MK Tiwary 260.0 26.00 105.69 21.0 3.0 1.0 95 JA Morkel 107.0 15.28 157.35 5.0 6.0 0.0 97 LRPL Taylor 197.0 19.70 115.20 12.0 7.0 1.0 99 M Manhas 120.0 30.00 125.00 10.0 4.0 0.0 101 DT Christian 145.0 29.00 122.88 8.0 6.0 0.0 103 RA Jadeja 191.0 15.91 126.49 13.0 9.0 0.0 105 JEC Franklin 220.0 24.44 98.65 15.0 6.0 1.0 107 KC Sangakkara 200.0 18.18 108.69 21.0 4.0 1.0 109 Y Nagar 153.0 30.60 115.03 13.0 3.0 0.0 111 STR Binny 90.0 22.50 134.32 9.0 3.0 0.0 113 SS Tiwary 191.0 23.87 112.35 9.0 8.0 0.0 115 KD Karthik 238.0 18.30 111.73 30.0 2.0 0.0 117 AL Menaria 220.0 20.00 108.91 14.0 8.0 0.0 119 PA Patel 194.0 17.63 117.57 19.0 4.0 0.0 121 SC Ganguly 268.0 17.86 98.89 30.0 4.0 0.0 123 YK Pathan 194.0 19.40 114.79 10.0 7.0 0.0 125 Harbhajan Singh 108.0 12.00 135.00 14.0 3.0 0.0 127 RE Levi 83.0 13.83 113.69 10.0 4.0 1.0 129 LR Shukla 75.0 12.50 131.57 4.0 5.0 0.0 131 Y Venugopal Rao 132.0 22.00 104.76 8.0 5.0 0.0 133 AD Mathews 127.0 18.14 117.59 5.0 4.0 0.0 135 PP Chawla 106.0 13.25 120.45 9.0 4.0 0.0 137 Shakib Al Hasan 91.0 15.16 122.97 6.0 3.0 0.0 139 N Saini 140.0 14.00 99.29 16.0 0.0 1.0 141 MN Samuels 124.0 17.71 100.81 7.0 5.0 0.0 143 MJ Clarke 98.0 16.33 104.25 12.0 0.0 0.0 145 R Bhatia 35.0 11.66 125.00 4.0 0.0 0.0 147 R Vinay Kumar 68.0 13.60 109.67 3.0 2.0 0.0 149 P Kumar 35.0 11.66 116.66 2.0 1.0 0.0 151 J Botha 58.0 14.50 107.40 4.0 1.0 0.0 153 A Ashish Reddy 35.0 8.75 120.68 3.0 1.0 0.0 155 DL Vettori 31.0 7.75 119.23 3.0 1.0 0.0 157 SP Goswami 69.0 13.80 102.98 4.0 1.0 0.0 159 SL Malinga 55.0 9.16 103.77 4.0 3.0 0.0 161 RJ Peterson 32.0 10.66 106.66 3.0 1.0 0.0 163 R Ashwin 18.0 6.00 120.00 2.0 0.0 0.0 165 B Kumar 40.0 13.33 100.00 4.0 0.0 0.0 167 DW Steyn 19.0 3.80 90.47 0.0 1.0 0.0 169 A Mishra 16.0 5.33 80.00 1.0 0.0 0.0 171 Z Khan 12.0 6.00 70.58 1.0 0.0 0.0 173 WD Parnell 19.0 4.75 70.37 2.0 0.0 0.0 175 PC Valthaty 30.0 5.00 58.82 4.0 0.0 0.0 177 RP Singh 6.0 3.00 50.00 0.0 0.0 0.0 179 R Sharma 2.0 0.50 18.18 0.0 0.0 0.0
ipl2.columns
Index(['Name', 'Runs', 'Ave', 'SR', 'Fours', 'Sixes', 'HF'], dtype='object')
ipl2.describe().T.style.bar(
subset=['mean'],
color='Reds').background_gradient(
subset=['std'], cmap='ocean').background_gradient(subset=['50%'], cmap='PuBu')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Runs | 90.000000 | 219.933333 | 156.253669 | 2.000000 | 98.000000 | 196.500000 | 330.750000 | 733.000000 |
| Ave | 90.000000 | 24.729889 | 13.619215 | 0.500000 | 14.665000 | 24.440000 | 32.195000 | 81.330000 |
| SR | 90.000000 | 119.164111 | 23.656547 | 18.180000 | 108.745000 | 120.135000 | 131.997500 | 164.100000 |
| Fours | 90.000000 | 19.788889 | 16.399845 | 0.000000 | 6.250000 | 16.000000 | 28.000000 | 73.000000 |
| Sixes | 90.000000 | 7.577778 | 8.001373 | 0.000000 | 3.000000 | 6.000000 | 10.000000 | 59.000000 |
| HF | 90.000000 | 1.188889 | 1.688656 | 0.000000 | 0.000000 | 0.500000 | 2.000000 | 9.000000 |
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = ['Runs']
for column in col:
boxhistplot(column, ipl2)
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = [ 'Ave']
for column in col:
boxhistplot(column, ipl2)
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = ['SR']
for column in col:
boxhistplot(column, ipl2)
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = ['Fours']
for column in col:
boxhistplot(column, ipl2)
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = ['Sixes']
for column in col:
boxhistplot(column, ipl2)
def boxhistplot(columns,data):
fig = px.histogram(ipl2, x = ipl2[column], color = 'Name')
fig.show()
col = ['HF']
for column in col:
boxhistplot(column, ipl2)
for col_name in ipl2.drop(columns = 'Name').columns:
q1 = ipl2[col_name].quantile(0.25)
q3 = ipl2[col_name].quantile(0.75)
iqr = q3 - q1
low = q1 - 1.5 * iqr
high = q3 + 1.5 * iqr
ipl2.loc[(ipl2[col_name] < low) | (ipl2[col_name] > high), col_name] = ipl2[col_name].median()
sns.pairplot(ipl2, diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x18103ea1340>
plt.figure(figsize=(20,18))
sns.heatmap(ipl2.corr(), annot=True, cmap="flag")
<AxesSubplot:>
Data Driven Model
ipl3 = ipl2.iloc[:,1:7]
ipl3.describe()
| Runs | Ave | SR | Fours | Sixes | HF | |
|---|---|---|---|---|---|---|
| count | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 |
| mean | 213.972222 | 23.284444 | 122.861056 | 18.100000 | 6.988889 | 1.033333 |
| std | 146.382107 | 10.851282 | 16.832871 | 13.849147 | 5.829978 | 1.375549 |
| min | 2.000000 | 0.500000 | 80.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 98.000000 | 14.665000 | 111.670000 | 6.250000 | 3.000000 | 0.000000 |
| 50% | 196.250000 | 24.440000 | 120.202500 | 16.000000 | 6.000000 | 0.250000 |
| 75% | 322.750000 | 30.585000 | 131.997500 | 26.000000 | 10.000000 | 2.000000 |
| max | 590.000000 | 46.370000 | 164.100000 | 58.000000 | 20.000000 | 5.000000 |
import h2o
from h2o.estimators import H2OKMeansEstimator
h2o.init(strict_version_check=False, url="http://192.168.59.147:54321")
Checking whether there is an H2O instance running at http://192.168.59.147:54321 ..... not found. Attempting to start a local H2O server... ; Java HotSpot(TM) 64-Bit Server VM (build 25.311-b11, mixed mode) Starting server from C:\Users\Admin\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar Ice root: C:\Users\Admin\AppData\Local\Temp\tmp8sra_7mr JVM stdout: C:\Users\Admin\AppData\Local\Temp\tmp8sra_7mr\h2o_Admin_started_from_python.out JVM stderr: C:\Users\Admin\AppData\Local\Temp\tmp8sra_7mr\h2o_Admin_started_from_python.err Server is running at http://127.0.0.1:54323 Connecting to H2O server at http://127.0.0.1:54323 ... successful.
| H2O_cluster_uptime: | 04 secs |
| H2O_cluster_timezone: | Asia/Kolkata |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.34.0.3 |
| H2O_cluster_version_age: | 1 month and 13 days |
| H2O_cluster_name: | H2O_from_python_Admin_7zm4v6 |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 1.689 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54323 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| H2O_API_Extensions: | Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 |
| Python_version: | 3.8.8 final |
from timeit import default_timer as timer
from datetime import timedelta
import time
start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))
Time: 0:00:00.000055
dataset_h2o = h2o.H2OFrame(ipl2)
h2o_km = H2OKMeansEstimator(k=2, init="furthest", standardize=True)
start = timer()
h2o_km.train(training_frame=dataset_h2o)
end = timer()
user_points = h2o.H2OFrame(h2o_km.centers())
h2o_km.show()
time_km = timedelta(seconds=end-start)
print("Time:", time_km)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637403440366_1 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 90.0 | 2.0 | 0.0 | 3.0 | 274.791997 | 534.0 | 259.208003 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 274.79199757458264 Total Sum of Square Error to Grand Mean: 533.999999018899 Between Cluster Sum of Square Error: 259.2080014443163 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 38.0 | 161.022319 | |
| 1 | 2.0 | 52.0 | 113.769679 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-20 15:47:42 | 0.056 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-20 15:47:42 | 0.149 sec | 1.0 | 90.0 | 527.943181 | |
| 2 | 2021-11-20 15:47:42 | 0.181 sec | 2.0 | 1.0 | 275.111288 | |
| 3 | 2021-11-20 15:47:42 | 0.181 sec | 3.0 | 0.0 | 274.791997 |
Time: 0:00:00.443291
data_h2o_dataset = h2o.H2OFrame(ipl2)
h2o_km_dataset = H2OKMeansEstimator(k=2, init="furthest", standardize=True)
start = timer()
h2o_km_dataset.train(x=["Runs", "Ave", "SR", "Fours", "Sixes", "HF"], training_frame=data_h2o_dataset)
end = timer()
user_points = h2o.H2OFrame(h2o_km_dataset.centers())
h2o_km_dataset.show()
print("Time:", timedelta(seconds=end-start))
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637403440366_2 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 90.0 | 2.0 | 0.0 | 5.0 | 274.791997 | 534.0 | 259.208003 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 274.79199757458264 Total Sum of Square Error to Grand Mean: 533.999999018899 Between Cluster Sum of Square Error: 259.2080014443163 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 38.0 | 161.022319 | |
| 1 | 2.0 | 52.0 | 113.769679 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-20 15:48:01 | 0.005 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-20 15:48:01 | 0.024 sec | 1.0 | 90.0 | 638.714608 | |
| 2 | 2021-11-20 15:48:01 | 0.026 sec | 2.0 | 8.0 | 297.874877 | |
| 3 | 2021-11-20 15:48:01 | 0.027 sec | 3.0 | 5.0 | 280.975640 | |
| 4 | 2021-11-20 15:48:01 | 0.028 sec | 4.0 | 2.0 | 275.437811 | |
| 5 | 2021-11-20 15:48:01 | 0.029 sec | 5.0 | 0.0 | 274.791997 |
Time: 0:00:00.273586
from h2o.estimators.aggregator import H2OAggregatorEstimator
params = { "target_num_exemplars": 200,
"rel_tol_num_exemplars": 0.5,
"categorical_encoding": "eigen"}
agg = H2OAggregatorEstimator(**params)
start = timer()
agg.train(x=["Name","Runs", "Ave", "SR", "Fours", "Sixes", "HF"], training_frame=data_h2o_dataset)
data_agg_12_dataset = agg.aggregated_frame
h2o_km_co_agg_12_dataset = H2OKMeansEstimator(k=2, user_points=user_points, standardize=True)
h2o_km_co_agg_12_dataset.train(x=["Name","Runs", "Ave", "SR", "Fours", "Sixes", "HF"],training_frame=data_agg_12_dataset)
end = timer()
h2o_km_co_agg_12_dataset.show()
time_h2o_km_co_agg_12_dataset = timedelta(seconds=end-start)
print("Time:", time_h2o_km_co_agg_12_dataset)
aggregator Model Build progress: |███████████████████████████████████████████████| (done) 100% kmeans Model Build progress: |███████████████████████████████████████████████████| (done) 100% Model Details ============= H2OKMeansEstimator : K-means Model Key: KMeans_model_python_1637403440366_4 Model Summary:
| number_of_rows | number_of_clusters | number_of_categorical_columns | number_of_iterations | within_cluster_sum_of_squares | total_sum_of_squares | between_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 90.0 | 2.0 | 0.0 | 2.0 | 274.791997 | 534.0 | 259.208003 |
ModelMetricsClustering: kmeans ** Reported on train data. ** MSE: NaN RMSE: NaN Total Within Cluster Sum of Square Error: 274.79199757458264 Total Sum of Square Error to Grand Mean: 533.999999018899 Between Cluster Sum of Square Error: 259.2080014443163 Centroid Statistics:
| centroid | size | within_cluster_sum_of_squares | ||
|---|---|---|---|---|
| 0 | 1.0 | 38.0 | 161.022319 | |
| 1 | 2.0 | 52.0 | 113.769679 |
Scoring History:
| timestamp | duration | iterations | number_of_reassigned_observations | within_cluster_sum_of_squares | ||
|---|---|---|---|---|---|---|
| 0 | 2021-11-20 15:48:18 | 0.000 sec | 0.0 | NaN | NaN | |
| 1 | 2021-11-20 15:48:18 | 0.000 sec | 1.0 | 90.0 | 274.791997 | |
| 2 | 2021-11-20 15:48:18 | 0.017 sec | 2.0 | 0.0 | 274.791997 |
Time: 0:00:00.800094
data_agg_df_12_dataset = data_agg_12_dataset.as_data_frame()
data_agg_df_12_dataset["Name"] = data_agg_df_12_dataset["Name"].astype("category")
groups = data_agg_df_12_dataset.groupby("Name")
fig, ax = plt.subplots(1,1,figsize=(20,15))
for name, group in groups:
ax.plot(group.Runs, group.Sixes, marker='o', linestyle='', ms=7, label=name)
fig.suptitle("Aggregated Dataset", fontsize=20)
ax.legend(numpoints=1)
<matplotlib.legend.Legend at 0x18106904400>
data_agg_df_12_dataset.head()
| Name | Runs | Ave | SR | Fours | Sixes | HF | counts | |
|---|---|---|---|---|---|---|---|---|
| 0 | CH Gayle | 196.5 | 24.44 | 160.74 | 46 | 6 | 0.5 | 1 |
| 1 | G Gambhir | 590.0 | 36.87 | 143.55 | 16 | 17 | 0.5 | 1 |
| 2 | V Sehwag | 495.0 | 33.00 | 161.23 | 57 | 19 | 5.0 | 1 |
| 3 | CL White | 479.0 | 43.54 | 149.68 | 41 | 20 | 5.0 | 1 |
| 4 | S Dhawan | 569.0 | 40.64 | 129.61 | 58 | 18 | 5.0 | 1 |
b1= data_agg_df_12_dataset.groupby('Name')['Runs'].sum().sort_values(ascending = False ).head(10)
b2= data_agg_df_12_dataset.groupby('Name')['Ave'].sum().sort_values(ascending = False ).head(10)
b3= data_agg_df_12_dataset.groupby('Name')['SR'].sum().sort_values(ascending = False ).head(10)
b4= data_agg_df_12_dataset.groupby('Name')['Fours'].sum().sort_values(ascending = False ).head(10)
b5= data_agg_df_12_dataset.groupby('Name')['Sixes'].sum().sort_values(ascending = False ).head(10)
b6= data_agg_df_12_dataset.groupby('Name')['HF'].sum().sort_values(ascending = False ).head(10)
print(b1.to_string())
print(b2.to_string())
print(b3.to_string())
print(b4.to_string())
print(b5.to_string())
print(b6.to_string())
#Top 10 players by Runs, Average, Strike Rate, Fours, Sixes and Half Centuries
Name G Gambhir 590.0 S Dhawan 569.0 AM Rahane 560.0 V Sehwag 495.0 CL White 479.0 R Dravid 462.0 SK Raina 441.0 RG Sharma 433.0 Mandeep Singh 432.0 JH Kallis 409.0 Name DJ Bravo 46.37 CL White 43.54 SR Watson 42.50 DB Das 42.00 S Dhawan 40.64 SPD Smith 40.22 AM Rahane 40.00 AB de Villiers 39.87 DR Smith 39.25 OA Shah 37.77 Name DA Warner 164.10 V Sehwag 161.23 AB de Villiers 161.11 CH Gayle 160.74 DR Smith 160.20 JA Morkel 157.35 SR Watson 151.78 CL White 149.68 KP Pietersen 147.34 G Gambhir 143.55 Name S Dhawan 58 V Sehwag 57 Mandeep Singh 53 CH Gayle 46 CL White 41 M Vijay 39 SR Tendulkar 39 RG Sharma 39 DMD Jayawardene 39 SE Marsh 39 Name DJ Bravo 20 CL White 20 KP Pietersen 20 SK Raina 19 V Sehwag 19 RG Sharma 18 S Dhawan 18 F du Plessis 17 G Gambhir 17 DJ Hussey 17 Name S Dhawan 5.0 RG Sharma 5.0 CL White 5.0 V Sehwag 5.0 AM Rahane 5.0 KP Pietersen 3.0 DA Warner 3.0 F du Plessis 3.0 TM Dilshan 3.0 DMD Jayawardene 3.0
Dimensionality reduction techniques that can be implemented using python 1 Missing Value Ratio 2 Low Variance Filter 3 High Correlation Filter 4 Random Forest 5 Backward Feature Elimination 6 Forward Feature Selection 7 Factor Analysis 8 Principal Component Analysis 9 Independent Component Analysis 10 Methods Based on Projections 11 t-Distributed Stochastic Neighbor Embedding (t-SNE) 12 UMAP 13 Autoencoder
Dimensional reduction on image data using simplest possible autoencoder
import keras
from keras import layers
encoding_dim = 32
input_img = keras.Input(shape=(784,))
encoded = layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = layers.Dense(784, activation='sigmoid')(encoded)
autoencoder = keras.Model(input_img, decoded)
encoder = keras.Model(input_img, encoded)
encoded_input = keras.Input(shape=(encoding_dim,))
decoder_layer = autoencoder.layers[-1]
decoder = keras.Model(encoded_input, decoder_layer(encoded_input))
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
from keras.datasets import mnist
import numpy as np
(x_train, _), (x_test, _) = mnist.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
print(x_train.shape)
print(x_test.shape)
(60000, 784) (10000, 784)
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
Epoch 1/50 235/235 [==============================] - 15s 52ms/step - loss: 0.2764 - val_loss: 0.1876 Epoch 2/50 235/235 [==============================] - 11s 49ms/step - loss: 0.1702 - val_loss: 0.1542 Epoch 3/50 235/235 [==============================] - 11s 49ms/step - loss: 0.1451 - val_loss: 0.1348 Epoch 4/50 235/235 [==============================] - 11s 49ms/step - loss: 0.1294 - val_loss: 0.1219 Epoch 5/50 235/235 [==============================] - 12s 50ms/step - loss: 0.1185 - val_loss: 0.1130 Epoch 6/50 235/235 [==============================] - 11s 48ms/step - loss: 0.1110 - val_loss: 0.1070 Epoch 7/50 235/235 [==============================] - 11s 48ms/step - loss: 0.1058 - val_loss: 0.1026 Epoch 8/50 235/235 [==============================] - 13s 54ms/step - loss: 0.1022 - val_loss: 0.0995 Epoch 9/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0995 - val_loss: 0.0972 Epoch 10/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0975 - val_loss: 0.0955 Epoch 11/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0962 - val_loss: 0.0945 Epoch 12/50 235/235 [==============================] - 11s 49ms/step - loss: 0.0954 - val_loss: 0.0939 Epoch 13/50 235/235 [==============================] - 11s 49ms/step - loss: 0.0948 - val_loss: 0.0933 Epoch 14/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0944 - val_loss: 0.0930 Epoch 15/50 235/235 [==============================] - 12s 52ms/step - loss: 0.0942 - val_loss: 0.0928 Epoch 16/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0940 - val_loss: 0.0926 Epoch 17/50 235/235 [==============================] - 10s 44ms/step - loss: 0.0938 - val_loss: 0.0925 Epoch 18/50 235/235 [==============================] - 12s 49ms/step - loss: 0.0937 - val_loss: 0.0924 Epoch 19/50 235/235 [==============================] - 12s 52ms/step - loss: 0.0935 - val_loss: 0.0922 Epoch 20/50 235/235 [==============================] - 13s 57ms/step - loss: 0.0935 - val_loss: 0.0923 Epoch 21/50 235/235 [==============================] - 11s 45ms/step - loss: 0.0934 - val_loss: 0.0921 Epoch 22/50 235/235 [==============================] - 12s 49ms/step - loss: 0.0933 - val_loss: 0.0922 Epoch 23/50 235/235 [==============================] - 13s 54ms/step - loss: 0.0932 - val_loss: 0.0920 Epoch 24/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0932 - val_loss: 0.0920 Epoch 25/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0931 - val_loss: 0.0920 Epoch 26/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0931 - val_loss: 0.0919 Epoch 27/50 235/235 [==============================] - 12s 50ms/step - loss: 0.0930 - val_loss: 0.0921 Epoch 28/50 235/235 [==============================] - 10s 44ms/step - loss: 0.0930 - val_loss: 0.0918 Epoch 29/50 235/235 [==============================] - 12s 52ms/step - loss: 0.0930 - val_loss: 0.0919 Epoch 30/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0929 - val_loss: 0.0917 Epoch 31/50 235/235 [==============================] - 12s 50ms/step - loss: 0.0929 - val_loss: 0.0918 Epoch 32/50 235/235 [==============================] - 11s 49ms/step - loss: 0.0929 - val_loss: 0.0918 Epoch 33/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0929 - val_loss: 0.0918 Epoch 34/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0928 - val_loss: 0.0918 Epoch 35/50 235/235 [==============================] - 11s 46ms/step - loss: 0.0928 - val_loss: 0.0917 Epoch 36/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0928 - val_loss: 0.0917 Epoch 37/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0928 - val_loss: 0.0916 Epoch 38/50 235/235 [==============================] - 11s 46ms/step - loss: 0.0928 - val_loss: 0.0916 Epoch 39/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0928 - val_loss: 0.0916 Epoch 40/50 235/235 [==============================] - 12s 52ms/step - loss: 0.0927 - val_loss: 0.0916 Epoch 41/50 235/235 [==============================] - 10s 44ms/step - loss: 0.0927 - val_loss: 0.0917 Epoch 42/50 235/235 [==============================] - 10s 45ms/step - loss: 0.0927 - val_loss: 0.0916 Epoch 43/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0927 - val_loss: 0.0916 Epoch 44/50 235/235 [==============================] - 11s 45ms/step - loss: 0.0927 - val_loss: 0.0915 Epoch 45/50 235/235 [==============================] - 12s 49ms/step - loss: 0.0927 - val_loss: 0.0915 Epoch 46/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0927 - val_loss: 0.0915 Epoch 47/50 235/235 [==============================] - 12s 51ms/step - loss: 0.0927 - val_loss: 0.0916 Epoch 48/50 235/235 [==============================] - 10s 43ms/step - loss: 0.0927 - val_loss: 0.0915 Epoch 49/50 235/235 [==============================] - 11s 47ms/step - loss: 0.0926 - val_loss: 0.0915 Epoch 50/50 235/235 [==============================] - 11s 48ms/step - loss: 0.0926 - val_loss: 0.0915
<keras.callbacks.History at 0x18103bfe130>
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder.predict(encoded_imgs)
import matplotlib.pyplot as plt
n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()